{
 "cells": [
  {
   "cell_type": "raw",
   "id": "a47da0d0-0927-4adb-93e6-99a434f732cf",
   "metadata": {},
   "source": [
    "---\n",
    "sidebar_position: 3\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2195672-0cab-4967-ba8a-c6544635547d",
   "metadata": {},
   "source": [
    "# Structuring\n",
    "\n",
    "One of the most important steps in retrieval is turning a text input into the right search and filter parameters. This process of extracting structured parameters from an unstructured input is what we refer to as **query structuring**.\n",
    "\n",
    "To illustrate, let's return to our example of a Q&A bot over the LangChain YouTube videos from the [Quickstart](/docs/use_cases/query_analysis/quickstart) and see what more complex structured queries might look like in this case."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a4079b57-4369-49c9-b2ad-c809b5408d7e",
   "metadata": {},
   "source": [
    "## Setup\n",
    "#### Install dependencies\n",
    "\n",
    "```{=mdx}\n",
    "import IntegrationInstallTooltip from \"@mdx_components/integration_install_tooltip.mdx\";\n",
    "import Npm2Yarn from \"@theme/Npm2Yarn\";\n",
    "\n",
    "<IntegrationInstallTooltip></IntegrationInstallTooltip>\n",
    "\n",
    "<Npm2Yarn>\n",
    "  @langchain/core zod\n",
    "</Npm2Yarn>\n",
    "```\n",
    "\n",
    "#### Set environment variables\n",
    "\n",
    "```\n",
    "# Optional, use LangSmith for best-in-class observability\n",
    "LANGSMITH_API_KEY=your-api-key\n",
    "LANGCHAIN_TRACING_V2=true\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c20b48b8-16d7-4089-bc17-f2d240b3935a",
   "metadata": {},
   "source": [
    "### Load example document\n",
    "\n",
    "Let's say we loaded a document with the following metadata:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae6921e1-3d5a-431c-9999-29a5f33201e1",
   "metadata": {},
   "source": [
    "```json\n",
    "{\n",
    " \"source\": \"pbAd8O1Lvm4\",\n",
    " \"title\": \"Self-reflective RAG with LangGraph: Self-RAG and CRAG\",\n",
    " \"description\": \"Unknown\",\n",
    " \"view_count\": 9006,\n",
    " \"thumbnail_url\": \"https://i.ytimg.com/vi/pbAd8O1Lvm4/hq720.jpg\",\n",
    " \"publish_date\": \"2024-02-07 00:00:00\",\n",
    " \"length\": 1058,\n",
    " \"author\": \"LangChain\"\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57396e23-c192-4d97-846b-5eacea4d6b8d",
   "metadata": {},
   "source": [
    "## Query schema\n",
    "\n",
    "In order to generate structured queries we first need to define our query schema. We can see that each document has a title, view count, publication date, and length in seconds. Let's assume we've built an index that allows us to perform unstructured search over the contents and title of each document, and to use range filtering on view count, publication date, and length.\n",
    "\n",
    "To start we'll create a schema with explicit min and max attributes for view count, publication date, and video length so that those can be filtered on. And we'll add separate attributes for searches against the transcript contents versus the video title. \n",
    "\n",
    "We could alternatively create a more generic schema where instead of having one or more filter attributes for each filterable field, we have a single `filters` attribute that takes a list of (attribute, condition, value) tuples. We'll demonstrate how to do this as well. Which approach works best depends on the complexity of your index. If you have many filterable fields then it may be better to have a single `filters` query attribute. If you have only a few filterable fields and/or there are fields that can only be filtered in very specific ways, it can be helpful to have separate query attributes for them, each with their own description."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "0b51dd76-820d-41a4-98c8-893f6fe0d1ea",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { RunnableLambda } from '@langchain/core/runnables';\n",
    "import { z } from 'zod';\n",
    "\n",
    "const tutorialSearch = z.object({\n",
    "  content_search: z.string().describe(\"Similarity search query applied to video transcripts.\"),\n",
    "  title_search: z.string().describe(\"Alternate version of the content search query to apply to video titles. Should be succinct and only include key words that could be in a video title.\"),\n",
    "  min_view_count: z.number().optional().describe(\"Minimum view count filter, inclusive. Only use if explicitly specified.\"),\n",
    "  max_view_count: z.number().optional().describe(\"Maximum view count filter, exclusive. Only use if explicitly specified.\"),\n",
    "  earliest_publish_date: z.date().optional().describe(\"Earliest publish date filter, inclusive. Only use if explicitly specified.\"),\n",
    "  latest_publish_date: z.date().optional().describe(\"Latest publish date filter, exclusive. Only use if explicitly specified.\"),\n",
    "  min_length_sec: z.number().optional().describe(\"Minimum video length in seconds, inclusive. Only use if explicitly specified.\"),\n",
    "  max_length_sec: z.number().optional().describe(\"Maximum video length in seconds, exclusive. Only use if explicitly specified.\"),\n",
    "});\n",
    "\n",
    "const prettyPrint = (obj: z.infer<typeof tutorialSearch>) => {\n",
    "    for (const field in obj) {\n",
    "        if (obj[field] !== undefined) {\n",
    "            console.log(`${field}: ${JSON.stringify(obj[field], null, 2)}`);\n",
    "        }\n",
    "    }\n",
    "}\n",
    "\n",
    "const prettyPrintRunnable = new RunnableLambda({\n",
    "    func: prettyPrint,\n",
    "}).withConfig({ runName: 'prettyPrint' });"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8b08c52-1ce9-4d8b-a779-cbe8efde51d1",
   "metadata": {},
   "source": [
    "## Query generation\n",
    "\n",
    "To convert user questions to structured queries we'll make use of a function-calling model. LangChain has some nice constructors that make it easy to specify a desired function call schema via a Zod schema:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "640119a6",
   "metadata": {},
   "source": [
    "```{=mdx}\n",
    "import ChatModelTabs from \"@theme/ChatModelTabs\";\n",
    "\n",
    "<ChatModelTabs customVarName=\"llm\" />\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "783c03c3-8c72-4f88-9cf4-5829ce6745d6",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { ChatPromptTemplate } from \"@langchain/core/prompts\";\n",
    "\n",
    "const system = `You are an expert at converting user questions into database queries.\n",
    "You have access to a database of tutorial videos about a software library for building LLM-powered applications.\n",
    "Given a question, return a database query optimized to retrieve the most relevant results.\n",
    "\n",
    "If there are acronyms or words you are not familiar with, do not try to rephrase them.`\n",
    "\n",
    "const prompt = ChatPromptTemplate.fromMessages(\n",
    "    [\n",
    "      [\"system\", system],\n",
    "      [\"human\", \"{question}\"],\n",
    "    ]\n",
    "  )\n",
    "const llmWithTools = llm.withStructuredOutput(tutorialSearch, {\n",
    "    name: \"TutorialSearch\",\n",
    "})\n",
    "const queryAnalyzer = prompt.pipe(llmWithTools);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f403517a-b8e3-44ac-b0a6-02f8305635a2",
   "metadata": {},
   "source": [
    "Let's try it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "92bc7bac-700d-4666-b523-f0f8c3644ad5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"rag from scratch\"\n",
      "title_search: \"rag from scratch\"\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke({\"question\": \"rag from scratch\"});"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "af62af17-4f90-4dbd-a8b4-dfff51f1db95",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"chat langchain\"\n",
      "title_search: \"2023\"\n",
      "earliest_publish_date: \"2023-01-01T00:00:00Z\"\n",
      "latest_publish_date: \"2024-01-01T00:00:00Z\"\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\"question\": \"videos on chat langchain published in 2023\"}\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "87590c6d-edd7-4805-bf68-c906907f9291",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"multi-modal models agent\"\n",
      "title_search: \"multi-modal models agent\"\n",
      "max_length_sec: 300\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\n",
    "        \"question\": \"how to use multi-modal models in an agent, only videos under 5 minutes\"\n",
    "    }\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b35e7ddf-ed39-4e70-a980-29a4c2d93ebd",
   "metadata": {},
   "source": [
    "## Alternative: Succinct schema\n",
    "\n",
    "If we have many filterable fields then having a verbose schema could harm performance, or may not even be possible given limitations on the size of function schemas. In these cases we can try more succinct query schemas that trade off some explicitness of direction for concision:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "81a036c0-c770-47dc-8b06-1dcfa403fdb1",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { z } from 'zod';\n",
    "\n",
    "const Filter = z.object({\n",
    "field: z.union([\n",
    "    z.literal(\"view_count\"),\n",
    "    z.literal(\"publish_date\"),\n",
    "    z.literal(\"length_sec\")\n",
    "]),\n",
    "comparison: z.union([\n",
    "    z.literal(\"eq\"),\n",
    "    z.literal(\"lt\"),\n",
    "    z.literal(\"lte\"),\n",
    "    z.literal(\"gt\"),\n",
    "    z.literal(\"gte\")\n",
    "]),\n",
    "value: z.union([\n",
    "    z.number(),\n",
    "    z.string().refine((data) => !isNaN(Date.parse(data)), {\n",
    "    message: \"If field is publish_date then value must be a ISO-8601 format date\",\n",
    "    })\n",
    "]).describe(\"If field is publish_date then value must be a ISO-8601 format date\"),\n",
    "});\n",
    "  \n",
    "const tutorialSearch = z.object({\n",
    "content_search: z.string().describe(\"Similarity search query applied to video transcripts.\"),\n",
    "title_search: z.string().describe(\n",
    "    \"Alternate version of the content search query to apply to video titles. \" +\n",
    "    \"Should be succinct and only include key words that could be in a video title.\"\n",
    "),\n",
    "filters: z.array(Filter).default([]).describe(\n",
    "    \"Filters over specific fields. Final condition is a logical conjunction of all filters.\"\n",
    "),\n",
    "});"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "a1b48f5f-34d3-4abc-a652-936c593e6186",
   "metadata": {},
   "outputs": [],
   "source": [
    "const llmWithTools = llm.withStructuredOutput(tutorialSearch, {\n",
    "  name: \"TutorialSearch\",\n",
    "})\n",
    "const queryAnalyzer = prompt.pipe(llmWithTools);"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8f91b967-1b0e-4fff-9e00-63b1ea32ab2a",
   "metadata": {},
   "source": [
    "Let's try it out:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "4fefa1ac-509d-41e8-bfa3-a0f1481d9bab",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"rag from scratch\"\n",
      "title_search: \"rag\"\n",
      "filters: []\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke({\"question\": \"rag from scratch\"});"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "81171733-c3b6-4356-8081-81a757a5daf7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"chat langchain\"\n",
      "title_search: \"chat langchain\"\n",
      "filters: [\n",
      "  {\n",
      "    \"field\": \"publish_date\",\n",
      "    \"comparison\": \"gte\",\n",
      "    \"value\": \"2023-01-01\"\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\"question\": \"videos on chat langchain published in 2023\"}\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "47965e1b-6c87-4dce-9791-0007aa5a6a94",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"multi-modal models in an agent\"\n",
      "title_search: \"multi-modal models\"\n",
      "filters: [\n",
      "  {\n",
      "    \"field\": \"length_sec\",\n",
      "    \"comparison\": \"lt\",\n",
      "    \"value\": 300\n",
      "  },\n",
      "  {\n",
      "    \"field\": \"view_count\",\n",
      "    \"comparison\": \"gte\",\n",
      "    \"value\": 276\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\n",
    "        \"question\": \"how to use multi-modal models in an agent, only videos under 5 minutes and with over 276 views\"\n",
    "    }\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49a1def0-246e-47fd-9f7f-bc5d18bcd802",
   "metadata": {},
   "source": [
    "We can see that the analyzer handles integers well but struggles with date ranges. We can try adjusting our schema description and/or our prompt to correct this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "cce5857c-8a20-4dc0-a216-5330ee567195",
   "metadata": {},
   "outputs": [],
   "source": [
    "import { z } from 'zod';\n",
    "\n",
    "const tutorialSearch = z.object({\n",
    "    content_search: z.string().describe(\"Similarity search query applied to video transcripts.\"),\n",
    "    title_search: z.string().describe(\n",
    "      \"Alternate version of the content search query to apply to video titles. \" +\n",
    "      \"Should be succinct and only include key words that could be in a video title.\"\n",
    "    ),\n",
    "    filters: z.array(Filter).default([]).describe(\n",
    "      \"Filters over specific fields. Final condition is a logical conjunction of all filters. \" +\n",
    "      \"If a time period longer than one day is specified then it must result in filters that define a date range. \" +\n",
    "      `Keep in mind the current date is ${new Date().toISOString().split('T')[0]}.`\n",
    "    ),\n",
    "  });\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "id": "8d289ea7",
   "metadata": {},
   "outputs": [],
   "source": [
    "const llmWithTools = llm.withStructuredOutput(tutorialSearch, {\n",
    "  name: \"TutorialSearch\",\n",
    "})\n",
    "const queryAnalyzer = prompt.pipe(llmWithTools);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "id": "d7e287b0-a434-49df-a12f-04369bd12679",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"chat langchain\"\n",
      "title_search: \"chat langchain\"\n",
      "filters: [\n",
      "  {\n",
      "    \"field\": \"publish_date\",\n",
      "    \"comparison\": \"eq\",\n",
      "    \"value\": \"2023\"\n",
      "  }\n",
      "]\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\"question\": \"videos on chat langchain published in 2023\"}\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b938f083-4690-4283-9429-070ff3f46c0b",
   "metadata": {},
   "source": [
    "This seems to work!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cc228c39-01f0-4475-b9bf-15d33033dbb7",
   "metadata": {},
   "source": [
    "## Sorting: Going beyond search\n",
    "\n",
    "With certain indexes searching by field isn't the only way to retrieve results — we can also sort documents by a field and retrieve the top sorted results. With structured querying this is easy to accomodate by adding separate query fields that specify how to sort results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "id": "2b7ec524-f625-483f-bafb-f8301fded7ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "const tutorialSearch = z.object({\n",
    "    content_search: z.string().default(\"\").describe(\"Similarity search query applied to video transcripts.\"),\n",
    "    title_search: z.string().default(\"\").describe(\n",
    "      \"Alternate version of the content search query to apply to video titles. \" +\n",
    "      \"Should be succinct and only include key words that could be in a video title.\"\n",
    "    ),\n",
    "    min_view_count: z.number().optional().describe(\"Minimum view count filter, inclusive.\"),\n",
    "    max_view_count: z.number().optional().describe(\"Maximum view count filter, exclusive.\"),\n",
    "    earliest_publish_date: z.date().optional().describe(\"Earliest publish date filter, inclusive.\"),\n",
    "    latest_publish_date: z.date().optional().describe(\"Latest publish date filter, exclusive.\"),\n",
    "    min_length_sec: z.number().optional().describe(\"Minimum video length in seconds, inclusive.\"),\n",
    "    max_length_sec: z.number().optional().describe(\"Maximum video length in seconds, exclusive.\"),\n",
    "    sort_by: z.enum([\"relevance\", \"view_count\", \"publish_date\", \"length\"]).default(\"relevance\").describe(\"Attribute to sort by.\"),\n",
    "    sort_order: z.enum([\"ascending\", \"descending\"]).default(\"descending\").describe(\"Whether to sort in ascending or descending order.\"),\n",
    "  });"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "id": "082b0033",
   "metadata": {},
   "outputs": [],
   "source": [
    "const llmWithTools = llm.withStructuredOutput(tutorialSearch, {\n",
    "  name: \"TutorialSearch\",\n",
    "})\n",
    "const queryAnalyzer = prompt.pipe(llmWithTools);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "id": "15a399c3-e4c7-4cae-9f9e-ba22cf6cfcdc",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "title_search: \"LangChain\"\n",
      "sort_by: \"publish_date\"\n",
      "sort_order: \"descending\"\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\"question\": \"What has LangChain released lately?\"}\n",
    ");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "id": "0e613fa2-7be4-45ba-bc1a-8f1f02379d94",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sort_by: \"length\"\n",
      "sort_order: \"descending\"\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke({\"question\": \"What are the longest videos?\"});"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1893ee0-4760-4b39-986f-770be50f0d0e",
   "metadata": {},
   "source": [
    "We can even support searching and sorting together. This might look like first retrieving all results above a relevancy threshold and then sorting them according to the specified attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "id": "8d0285dc-a78f-4be5-b50c-a99f8137a5fa",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "content_search: \"agents\"\n",
      "sort_by: \"length\"\n",
      "sort_order: \"ascending\"\n"
     ]
    }
   ],
   "source": [
    "await queryAnalyzer.pipe(prettyPrintRunnable).invoke(\n",
    "    {\"question\": \"What are the shortest videos about agents?\"}\n",
    ");"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Deno",
   "language": "typescript",
   "name": "deno"
  },
  "language_info": {
   "file_extension": ".ts",
   "mimetype": "text/x.typescript",
   "name": "typescript",
   "nb_converter": "script",
   "pygments_lexer": "typescript",
   "version": "5.3.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
